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SYNOPSIS 
of the 

Ph. D. Dissertation 
on 

NEW AN^ffilTIC MODELS FOR MDDTIPROCESSORS 
VniTH VARIOUS INTERCONNECTION STRUCTURES 

by 

Adarshpal Singh Sethi 

Computer Science Programme 
Indian Institute of Technology, Kanpu.r 

August 1977 

With the rapid evolution of computer technology has come the 
need to configure cheap computer systems with large processing power 
and high reliability. One method of achieving these goals has been 
to exploit the parallelism of multiprocessor systems. In recent years, 
an increasing number of multiprocessors have been designed and/or built, 
such as C.mmp at Carnegie -Mellon University, the TDC-316 multiprocessor 
at the Tata Institute of Fundamental Research, Bombay, and the BBN 
Pliuribus Interface Message Processor for the ARPA Network. This acti- 
vity has given a spurt to work on computer modelling to analyse the 
performance of such systems at the level of the processor-memory inter- 
face. 

Work on performance evaluation, however, still lags far behind 
the advances in multiprocessor technology. The performance of a multi- 
processor system is crucially dependent on the interconnection mechanism 



used for communication between the fiunctional units. Hence the prime 
effort of ongoing research in this area is to devise better and more 
efficient interconnection schemes. The system designer must then have 
adequate tools available to enable him to evaluate and compare the per- 
formances of multiprocessors which use these various schemes. It is the 
problem of devising such evaluation techniques that wc address ourselves 
to in this thesis. 

Yve start the thesis by describing some of the important inter- 
connection schemes being used in multiprocessors. These are the crossbar 
svd-tch, time-shared bus, multiport memory/multibus, and a hybrid inter- 
connection scheme used in the Pluribus multiprocessor built by BBET for 
the AESA network. Salient characteristics as well as the advantages and 
disadvantages of these schemes are discussed. 

Yfe next give a sunpary of the existing work on analytic models for 
the crossbar switch multiprocessors. Most of the past research on this 
topic has assumed that the memory references of each processor are uni- 
formly distributed among all the memory modules. Although this assump- 
tion considerably simplifies the analysis, it is not realistic, since 
programs generally exhibit the property of locality of references. 

The first new result in this thesis is the development of a model 
for crossbar switch multiprocessfitrs with local referencing, which 
reflects more closely the behavior of real systems. This model is 
analysed using both discrete and continuous Markov chain techniques, and 
expressions are derived for the multiprocessor performance. Hew expre- 
ssions are also obtained for the performance in the traditional uniform 



reference model and are campaored \vith other expressions available in 
the literature. Results of a simulation study are presented to demons- 
trate the accuracy of the expressions for both models. 

Almost all the work to date on computer modelling for analysing 
the performance of multiprocessor systems has been limited to the study 
of systems using a crossbar svjitch as the interconnection medium. As 
mentioned earlier, the tools of analytic modelling need to be improved 
to keep pace with the innovative development of nev/ interconnection 
schemes. One of the main contributions of this thesis is the construc- 
tion of analytic models for multiprocessors using the time-shared bus 
and the hybrid Pluribus scheme as the interconnection structures. 

A discrete Iferkov chain model for time-shared bus multiprocessors 
is described. An example is given to explain the detailed analysis 
technique and simulation results are presented to verify the results 
of the analysis. 

EText, a model for evaluating the perfomance of the Plioribus 
multiprocessor is described. The Pluribus system is a hybrid contain- 
ing a crossbar switch and a, number of time-shared buses- The analytic 
model described here breaks the system into its crossbar switch and 
time-shared bus ccmponents, simultaneously taking into account the 
ccmplex interaction between these components. The crossbar switch is 
then analysed in terms of an existing model while the time-shared bus 
component is analysed using the model developed earlier. These results 
are sjmthesized to give the performance of the viiole system. G-raphical 
results are presented to show the effect of the various parameters of 
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the syscem on its perfomance. Simulation results presented validate 
the model. 

Finally, sene suggestions are made for further worlt in this area. 



CHAPTER 1 


IITEODIXITION 

Yfitli the rapid evolution of computer technology has come the 
need to construct computer systems which will solve larger problems 
in less time with higher reliability. Parallel process:.ng represents 
one of the more effective ways of achieving these goals ^ The initial 
development of systems incorporating parallelism v/as almost entirely 
motivated by considerations of reliability. Thus redundancy was pro- 
vided at various levels of the system to cater for catastrophic failure 
situations. 

It gradually came to be realized, however, that redundant components 
could actually be put to use to improve the performance of the system. 

If some of the resources fail, dynamic re-allocation of the remaining 
resources then results in graceful degradation or "fail-soft" operation. 

Another advantage of several parallel systems is their flexibility. 
Plexibility, in the words of Searle and Preberg f Sea 75 J "is a measure 
of the ease with vihich a system configuration can be altered." Por a 
system to be truly flexible, both its hardware and software should be 
capable of easy alteration. A better understanding of operating systems 
for large parallel systems has emerged recently, thus allowing them to be 
made more flexible. In fact, availability of better softv^are is one of 
the factors responsible for the increasing popularity of such systems, 

Hovrever, the greatest potential benefit of parallel systems lies in 
their performance capabilities. Electronic technology appears to be 
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approacliing limits imposed by electrical propagation delays. Parallel 
processing offers an attractive means of overcoming this problem. Plnm- 
netting hardware costs have also improved the cost/performance vieubility 
of these systems, The futinre is thus certainly going to witness a^n in- 
creasing exploitation of concurrency and pcrallelism at all possible 
levels. 

1.1 Types of Par all ell sm 

The term "parallel processing" encompasses in its scope a wide 
variety of computer systems. One classification has been given by 
Plynn [Ply 66 j Ply 72b], who divides computer sy, stems into four categories: 

(a) Single Instruction Single Data (SISD), 

(b) Single Instruction I&iltiple Data (SIMD), 

(c ) Iflultiple Instruction Single Data (MSD), and 

(d) MIL ti pie Instruction liEultiple Data (MM). 

SISD covers the usual uniprocessor computers. Associative processors, 
processing ensembles, and array processors, such as the ILLIAC lY, fall 
in the Sllffl category. Pipeline processors may be considered to be either 
of the SDID, KESD, or IIUMD architectures. Multiprocessors and multi- 
computer systems belong to the MEMD class. 

1 .2 Multiprocessors 

Multiprocessors, the subject of this thesis, as distinct from 
multiple-computer systems, are not easy to define. The difference betv/een 
a multiple-computer system and a multiprocessor is in tho extent and degree 
of sharing : whereas the fcarmer consists of several separate and discrete 
ccmputers, the latter is a single computer with multiple processing units. 
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Enslov/ fUns 74, Eas 7?] defines a multiprocessor as a system vri.th 
the follovh.ng characteristics: 

(a) It contains two or more processing units of approximately 
comparable capabilities; 

(b) ill processors share access to a common memory (although 
some private memory may be allowed); 

(c) ill processors share access to input/ output channels, 
control units, and devices; 

(d) There is a single integrated operating system in overall 
control of all hardware and software; and 

(e) There must be intimate interaction possible at both hard- 
v/are and software operating levels. 

Figure 1 depicts the basic structure of a multiprocessor system. 

Tiius a multiprocessor has capabilities for the sharing of memory and 
input/ output devices by all processors; the input/output devices also 
have complete access to memory. Hence the interconnection system has 
to support three types of communication : processor-memory, processor- 
I/O, and memory-l/O. 

Although multiprocessors have all the three advantages of reliability, 
flexibility, and higher performance mentioned earlier, they pose a number 
of problems not encountered in single-processor systems, A multiprocessor 
system must have special facilities, both h;xrdware and software, to resolve 
contention for shared resources. The operating system is larger and more 
complex than for 'uniprocessors. To properly exploit the available para- 
llelism, tasks need to be divided into subtasks which can be executed in 
parallel. This makes scheduling considerably more complicated. 




Processors 


PIG-URE 1 : Stinacirure of a l&iltiprocessor Systen 
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At the hardware level, proper nechanims have to be provided for 
cammunication between the various functional \3nits. The interconnection 
systm Ejust have high bandwidth and must be reliable. A processor should 
have the capability of interrupting other processors. Efficient failure- 
detection is important for high reliability and both manual and automatic 
reconfiguration should be possible. 

Perfoimance monitoring and evaluation of multiprocessors are not only 
more complex than for uniprocessors, they are also more important. For a 
long time there had been an impression that multiprocessors were not capa- 
ble of high performance and that they were not cost-effective. Th.th better 
evaluation techniques, these mistaken notions are now being dispelled. 

Work on performance evaluation, however, still lags behind the advances 
in multiprocessor technology. The performance of a imiltiprocessor system 
is crucially dependent on the interconnection mechanism used for communi- 
cation between the fimctional units. Hence the prime effort of ongoing 
research in this area is to devise better and more efficient interconnection 
schemes. The system designer must then have adequate tools available to 
enable him to evaluate and cempare the performances of multiprocessors 
vhidti use these various schoaos. It is the problem c£ devising such evalxia- 
tion techniques that we address ourselves to in this thesis. 

1 .3 Overview of the Thesis 

We start in Chapter 2 by describing some of the important inter- 
connection schemes used in multiprocesscrs, namely, the crossbar switch, 
time-^ared bus, multiport memory/multibus, and a hybrid interconnection 
scheme used in the EL'uribus multiprocessor built by Bolt, Beranek, and 
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TVff wnnn . inc . (BBH), Cambridge, Jlass., far the AEPA Network. Salient 
characteristics o,s well as the advantages and disadvantages of these 
schemes are discussed* 

We begin Chapter 3 with a sumnary of existing work on analytic 
models for crossbar switch multiprocessors. Most of the past research 
on this topic has assimed that the memory references of each processor 
are uniformly distributed among all the menory modules. .Although this 
assumption considerably simplifies the ancilysis, it is not very realisbiCj 
since programs generally exhibit the property of locality of references. 

In Chapter 3? we develop a model for crossbar svd.tch multiprocessors 
vath local refer^ncingj vvhich reflects more closely the behavior of real 
systems. Ihis model is analysed using both discrete and continuous fcrfcov 
chain techniques^ and expressions are derived for the multiprocessor per- 
fomance. New expressions are also obtained for the performance in the 
traditional uniform reference model and are canpared vath other expressions 
available in the literature. Results of a simulation study are given to 
show the accuracy of the expressions for both models. 

Almost all the work to date on computer modelling to ojialyse the 
performance of multiprocessor systems has been limited to the study of 
systems using a crossbar svitch as the interconnection mediuJt. As men- 
tioned in the previous sectioUj the tools of analytic modelling need to 
be improved to keep pace vath the innova.tive development of nev/ inter- 
connection schemes. Keeping this aim in view, one of the main contri- 
butions of this thesis is the construction of anc-lytic models for multi- 
processors using the time-shared bus and the hybrid Plixribus scheme 


as the interconnection structures. 
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Chapter 4 describes a discrete Markov chain nodel for time-shared bus 
nultiprocesscrs. An example is given to explain the detailed analysis 
technique and simulation results are presented to verify the results of 
the analysis. 

A nodel for evaluating the performance of the ELuribus multiprocessor 
is described in Chapter 5. The Pluribus is a hybrid systen containing a 
crossbar switch and a number of tine— shared buses. The analytic nodel 
described there deconposes the systen into its crossbar switch and tine- 
siiared bus components, sinultane ously taking into account the complex 
interaction between these components. The crossbar switch is then analy- 
sed in terms of an existing nodel viiile the nodel cf Chapter 4 is used to 
analyse the tine-shared bus ccnponent. These results are synthesized to 
give the performance of the v±Lole system. Craphical results are presented 
to show the effect of the various parameters of the systen on its perfor- 
mance. Simulation results are given to verify the validity of the nodel. 

Chapter 6 presents the conclusions and suggests the directions that 
future work in this area nay take. 



CHAPTER 2 


MDLTIPROCESSOR lETERGOKKECTIOH STRUCTURES 

Of paramount inportojace in a nultiprocessor system is the communi- 
cation mechanism and the node of interconnection between its functional 
units, namely, the processors, memory nodules, and the inptit/ output units. 
Sharing of memory nodules between multiple processors and I/O units resifLts 
in conflicts between units desiring to access the sane memory module at the 
sane time. This phenomenon is called memory interference and it is the 
primary cause of degradation in the nultiprocessor perfoimance. Thus the 
main task of an analytic model for a nultiprocessor is to estimate the 
anoxmt of memory interference in the system. 

In the models considered in Ihis thesis, the effect of input/output 
units 'jidll not be modellod explicitly. This is because, in most cases, 
their effect on the overall performance of the system is insignificant 
[Str 70] . Por example, transferring with four drums or 15 fixed head 
disks at full rate is comparable to the activity of one processor I Bel 7l] . 

Some of the important interconnection media used in multiprocessors 
are the crossbar switch, tine-shared bus, multiport nenory/multibus, and 
a hybrid interconnection scheme used in the BBF Pluribus multiprocessor. 
There are a number of good papers which focus on multiprocessor inter- 
connections Find 75, Bae 76, Ens, 74, Ens 77, Per 73j Sea 75, Swa 76] . 

In this chapter, we shall discuss the salient features as well as some 
of the advantages and disadvantages of these interconnection schemes. 

This discussion, however, constitutes only one wfay of interpreting these 
various schemes. There are other ways of viewing these structures as is 
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obvious from the references cited above. Svjan et al 76 \ for instance, 

regard these interconnection structxires as mere variants of one fundamental 
structure, namely, the crossbar switch. 

2*1 Crossbar Syatch 

In the crossbar switch organisation, shown in ligure 2, any memory 
module can be connected to any processor. A full-time connection is 
established betv/ecn the two units for the ccmplete duration of the trans- 
fer. In the absence of a conflict, multiple connections are possible at 
a time. Such an organization is characterized by high bandmdthj in fact, 
among all the interconnection schemes, the crossbar switch has the poten- 
tial for the highest total system transfer rate. 

Since all the circuitry for conflict resolution is incorporated in 
the smtch itself, the control logic of the memory modules is very simple. 

If the switch is distributed, the system can be made moflular as ’veil as 
reliable, and additional processors and/or memory modules con be added 
without too much difficulty. 

However, the crossbar switch is extremely complicated and costly. 

It has been estimated that the cost of a svd.tcli for a large system is 
comparable to the cost of a few processors [Ens 74]. 

An important example of a multiprocessor system usirsg a crossbar 
switch is C.rnmp, the multimini processor built at Carnegi e-Mellon University 
[Bel 71, YML 72] . Ihe TDC-316 multiprocessor under construction at the 
Tata Institute of Fundamental Ee search, Bombay [ Jos 76, May 76 ], also 
employs a crossbar svatch. 



Meuory Modules 
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Crossbar 
I Sv/itch 


PICDBE 2 : Crossbar Svatch Corfiguration. 


Processors 



Monory Modules 



PI&UEE 3 : Tine-Sh-ired Bus Configuration 
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2.2 Tjij!ie-sho.red Bus 

Ihe tine-shared bus systen, shown in Figure 3, is one of the sinplest 
and cheapest interconnection schenes. Here there are no continuous connec- 
tions between functional units. iO-l the units are connected in parallel 
to the bus Vvhich allovirs conaimication betv/een one pair of units a.t a. tine. 
Since there is only one transfer path for all transfers, the systen band- 
’jvidth and efficiency are lovir and the reliability is poor. It is very easy 
to physically nodify the systen configuration by adding or renoving func- 
tional. units. However, system expansion resiUts in considerable degrada- 
tion of the overall systen perfornance. 

For these reasons, the use of the tine— shared bus is limited and it 
is not generally used for high-performance nultiprocessors. Examples of 
systems using a tine-shared bus are the PDP-1 1 and the LocMieed SUE. A 
good description of the operation of a bus nay be found in [Thu 72] and 
in &an 71 ]• 

2.3 Multiport Menorv/MlLtibus 

The multiport nenory/nultibus systen, shown in Figure 4, tries to 
overcome sene of the disadvantages of the single tine— shaned bus. It is 
also less costly than the crossbar switch. For achieving high band- 
width, each processor nay be assigned a dedicated bus, although this is 
poor fron the point of view of reliability. If a bus bottleneck is 
present, expanding the number of buses increases the systen throughput. 

However, it is essential, in this orgaxiizati on, for the memory 
nodules to have a number of access ports which makes the control logic 
of the memories more conplex. The cabling and connector costs are large 
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PIGRJEE 4: ilultiport Monory/I.Sul'ti'bus Conficxxration 
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and becone nore proncfunced as the systen expands. The naxiciun configura- 
tion possible is also United by the number of ports available on a nenory 
nodixLe . 

The UinVAC 1108 and the IM Systen 360 Model 67 are exanples of such 
systens. 

2.4 ELuribus Organization 

This interesting unconventional schene ha.s been used by BBIJ for their 
ELuribus nultiprocessor to be used as an Interface Message Processor for 
the MPA Network T Bar 75 » Hea 73, Orn75] • This is a hybrid between the 
crossbar smteh and the tine-shared bus systens and is shown in figure 5. 

In this systen, there are a number of processor buses, each contain- 
ing a few processors and a few local neiuories connected by a tine-shared 
bus. The local nenories on a processor bus ore accessible only to the 
processors on thet bus. There are also additional nenory buses which con- 
tain only nenory nodtiles. These nenories are global and are accessible to 
any processor via the crossbar switch. In the systen built by BBIT, there 
are 7 processexr buses each having 2 processors and 2 local nenories, and 
2 nenory buses each having 2 global nenories. The crossbar svatch is dis- 
tributed and is inplemented by interconnecting units called bus couplers 
on various buses. 

The Pluribus systen is cheaper than the pure crossbar switch schene 
and does not have the bandwidth Imitations of the single bus schene. 
However, for naxinun efficiency, the bandwidths of the various conponents 
of the systen nust be carefully notched. 
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PIGUEE 5 : PLURIBUS Coirfigurntion. 
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2.5 Definitions end Hotatians 

A ncithenatical uodel of a system nay be constructed at various levels 
of abstraction [Bha 76] . The interconnection nechanisn in a, nultiprocessor 
foms the interfo.cebetween the processors and the nenory nodules. This 
interface has a significant effect on the nultiprocessor perfomancej for 
this reason, the models considered in this thesis are all at the level 
of the procGssor-uenory interface. 

At this level, the intero.ction between the processors and memories 
must be very precisely defined. In general, processor behavior varies 
for different instructions. However, we shall not explicitly model these 
differences in instructions. Yfe shall also make no distinction between 
the proces^ig needed to decode an instruction and the processing corres- 
ponding to its execution. Instead we shall use the concept of a mit 
instructi on, first proposed by Strecker [Str 70], v;hich simply models 
the fetching of a word from memory followed by the processing of the v/ord 
by a processor. 

A diagramatic representation of a mit instruction is shown in 

Figure 6(a). In this figure, t' represents the processing tine of the 

P 

processor, t^ is the memory cycle time, V the memory access time and 
t^ the nenory rewrite time. 

In most cases of interest to us, t ' will generally be greater than 

Jr 

or equal to tl. In these cases, to facilitate uir discussion, a trans- 

formation may be made on the diagram of Figure 6 (a) to give Figure 6(b). 

The memory now has an access time of t and zero rewrite time? the new 

c 

processing time is transformation introduces no change 
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PIGDEE 6: Diagranaatic reprosentation of a Unit 
instruction. 
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in the sense thnt the perfomance of the systen in either ca,se is the 
sane. In both cases, the neuory access begins at point 1, the nenory 
is ready to service a. new request at point 2, and the processor execu- 
tion is conpleted a.t point 3. (This transfornation was first used by 
Strecker [ Str 70l). Thus, in all our nodels, without loss of genercJ-ity, 
we shall assime the uenory access tine to be equal to the cycle tine, 
and the rewrite tine to be aero. 

The perfornonce neousure used nost often in this thesis is the Unit 
Instruction Execution Rate (UER) which is the total nuaber of unit ins- 
tructions executed by the systen in vinit tine (generally 1 Usee.). Ue 
would like to enpho-size, ho^i/ever, that other perfornance nea.sures, such 
as utilization fa,ctor, percentage idle tine, etc., can nil be reduced to 
UER, and given one, all the others can be ea.sily calculated, if necesso-ry. 
Thus, processor utilization = V x UER, nenory utilization = t^ x UER 
etc . 



CHAPTER 3 


MODELS FOR CROSSBAR SlITCH SYSTE'IS 

In this chapter, we shall be concerned with analytic models for 
multiprocessors using a crossbar svd-tch. In Section 3 -I, we give a 
summary of past work in tills area. All these existing models make the 
assumption that the memory references of each processor are unifoimly 
distributed among the memory modules. Although this assumption consi- 
derably simplifies the analysis, it is not very realistic, since pro- 
grams gen(3rally exhibit the property of locality of references. We 
shall propose a model vath local referencing, vthich reflects more closely 
the behavior of real systems. Section 3*2 lists in detail the assump- 
ticns made for this model. In Section 3.3, we use discrete Markov chain 
techniques to analyse this model. Section 3.4 presents an alternative 
analysis using continuous Markov chain processes. Simulation results 
are given in Section 3 <5. Finally, new expressions for the uniform 
reference model are developed in Section 3.6. 

3 .1 Review of Existing Models 

In this section, we consider systems with p processors and m memory 
modules. We shall denote by t^ the average processing time of all pro- 
cessors (which are assumed to be identical); all memory modules are assumed 
to have equal constant cycle times t with access time t„ and rewrite 

C SL 

time t . 
w 


18 



19 


Sk_nner and Asher [ Ski 69 ] v/ere the first to use Markov chain 

models to analyse multiprocessors. However, their study ms limited 

to a small number of processors and memory modules, and they fotmd 

it difficult to generalize their expressions for larger systems. 

Strecker [str 703 using some simplifying assmptions was able to 

give general approximate expressions. He considered three cases s 

(a) t t , (b) t = t , and (c) t > t . Of these, the third case is 
pw^ pw’ pw ^ 

important because many real systems fall in this category. Strecker 

gives the following expression for the performance of a multiprocessor 

system for case (c) (t > t ): 

p w 

UER = (m/t^)(l - (l - P/m)P) 
t - t 

where P + (ffl/p)(-— ^)(l - (l - P /m)^) -1=0 (3.1 ) 

ill u Hi 

c 

In Chapter 3 we shall have occasion to use this equation. There 
we shall assume t^ to be zero; since t^ is always positive, our system 
vd.ll fall under case (c) and Equation (3.1 ) will be applicable to it. 

It should, hovrever, be remembered that Strecker' s analysis is approximate, 
not exact. 

Bhandarkar [Bha 75l used discrete Markov chain models to analyse 
..'tos (b) and (c) above. He used this analysis to ’write a program to 
-■'-Hate exact values for the system performance. However, his program 
time-consuming for even moderately-sized systems. Bhandarkar 
■^ti’ied Strecker 's expression for case (b) in the light of the 
^Its available from his program. Bhandarkar and Puller 
"^o.-lysed multiprocessors using a continuous- time Markov chain 
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model . 

In this chapter we shall consider only those multiprocessor systems 
in which the memory is partitioned into modules by the higher order bits 
of the address. Memories interleaved by the low order address bits have 
been studied by Burnett and Coffman [Bur 70, Bur 73, Bur 75] j also 
jointly with Snowdon [Cof 71 1 , and by Sastry and Kain pas 75 ] . It 
should be noted that if we assume unifonaly distributed memory referunces 
by ea.ch processor, then the behavior of low-order interleaved memories 
is no different from that of high-order interleaved memories. Baskett 
and Smith [Bas 76 1 have given asymptotic results for low-order inter- 
leaved memories with unifonaly distributed references. Thus their physi- 
cal model is the same as that of Bhandarkar, vdth the difference that they 
have studied its asymptotic behavior. 

3.2 Assumptions for Local Reference Model 

The following major assumptions chenacterize the local reference 
model developed in this chapter; 

Assumption 1 : The system has p processors and m memory modules. All 
processors and all memory modules are identical. This will be referred 
to as a p X m system. 

AssTumpti on 2 ; Instructions of the processors are modelled using the 
concept of the unit instruction defined in Section 2.5. 

Assumption 3 ; All memory modules have equal constant cycle times and 
their operation is synchronized with no overlapping of read/write cycles. 
The access time of each module is equal to its cycle time, and the 
rewrite time is zero (see Pigure 6(b)). The processing time of each 



21 


processor is assumed to be zero. 

Assumption 4 : The processors ond memories ojre connected by a crossbar 

switch which permits every processor to have access to every memory 
module. All memory modules are simultaneously accessible so that, under 
no conflict, a maximum of min(p,m) words can be fetched simultaneously. 

The svd-tch is assumed to have zero delay. However, crossbar s^d-tches 
vdth nonzero delay may be modelled by simply adding the delay to t^, the 
memory cycle tine. 

Assumption 5 ; Prom each memory module only one word can be fetched o.t 
a time. If two or more processors simultaneously ma.ke requests for the 
same memory module, only one of these requests can be served in the next 
memory cycle. The other processors are queued up at the module to be 
served in subsequent cycles. 

Assumpti on 6 ; Consecutive addresses in memory are mapped into the same 
module nodiiLo the nodule size. Thus the high-order bits of an a-ddress 
determine the module to #iich the address belongs. 

Assumption 7 : Successive requests of a processor follow the pattern 

described below. If the k-th request of a processor is for memory module 
i, then its (k + 1 )st request will be for module i with probability a, 
and for module 3(3 ^ i) with probability (l - a)/(m - 1 ). GHius all memory 
modules except module i are accessed with equal probability. Probability 
a is a constant and is equal for all processors. 
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I'b should be noted that if w = i/m then all memory modiiLes are 
accessed vri-th equal probability, and this model reduces to the uniform 
reference model analysed by Bhandarkar fBha 75 1* HoT,/ever, in general, 

(X vdll not equal 1/m, in which case v/e shall call this the local r eference 
model ♦ 

If ciis large compared to 1/m, the processor vill tend to access 
the same memory module repeatedly until it changes to a different module, 
and the same beha-vior is repeated. It is our belief that this model is 
more representative of real-life multiprocessor systems than the uniform 
reference model. A multiprocessor system generally vrorks in a multi- 
programming environment in which each processor executes a more-or-less 
independent task, Ihus each processor v/ould concentrate its attention 
on blocks of consecutive addresses which, in our model, are mapped into 
the same module. Thus the probability of consecutive references being 
to the same module is quite high. Occasionally a task may be split into 
one or more modules; references may also be made to the executive which 
may reside in a different module. But this happens relatively infrequently; 
programs are also mostly sequential in nature and present-day programming 
styles emphasize modula,r programs. Hence the parameter a, though not 
equal to 1, will be quite close to it. It seems reasonable to assume 
that most such environments will have a > 0,75. We shall show later 
that the multiprocessor performance is more or less unaffected by the 
value of ci so long as a lies in this range. However, the performance of 
systems \>ri.th a > 0.75 is worse than that predicted by the uniform 


reference model 
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The perforacnce raeasuro used in this chapter is the Average IJunber of 
Busy Meaory Modules (aBH'I)- This is the- avorage nuuber of uenory nodules 
that arc busy during a nenor;/ cycle. Using this nea.sure ?ri.ll pernit us to 
conveniently conparo our results vd.th other results available in the litera- 
ture. Its relation with the Unit Instruction Execution Bate (UER) is given 
by 

UER = Am.v't^ . 

3,3 Mscrete Ivlarkov Chain Analysis 

In this section we shall analyse the local reference nodol, defined by 
the assunptions of the previous section, using discrete Markov chain tech- 
niques, An excellent description of Markov processes nay be found in 
ICLeinrock fKLe 75 ]» Haandarfcar [Haa 75 1 has developed a systonalic approach 
to the use of the discrete Markov chain technique for analysing uenory 
interference in nultiprocessor systens. Tfe shall use this technique in the 
analysis of this section, and a.gain in Cha,pter 4 for analysing the nodel 
for the tine-shared bus. 

The exact analysis of the Ifcrkov chain nodel is very couplex, even 
for the mifom reference nodel. For this reason, Haandarkar did not 
attenpt to derive general expressions for the systen pcrfomance vlth p 
and n as paraneters. Instead, he wrote a program to compute the Average 
Munber of Busy Memory Modules for any given p x n systen. 

In this section, shall derive such expressions for the local 
reference nodel with n as a paraneter for snail, constant values of p 
(such as 2 or 3 ), and correspondingly, with p as a. parameter for snail, 
constant values of n, Approxiaations of these expressions will then be 
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generalized to hold for all values of p and n. We shall use the sane 
approach vd.th the unifom reference nodel in Section 3.6. It should 
be noted that the local reference nodel has a as an additional paraneter 
nald.ng the analysis nore ccnplicated. Our interest will, however, lie 
in systems for vliich a exceeds 0,75* 

At any given tine, the state of a p x n system can be ch iracterized 
by the lengths of the queues at each nenory nodule. Following Haandar- 
kan's notation, this state is denoted by an n- tuple (k. ,ko, k ), 

where 

n 

^ k = p, 
i=1 

and 0 _< k^ ^ p for 1 < i <n. Integer k^ represents the nunber of 
processors waiting in the queue at nodule i (including the processor 
being served). Since all processors are identical, a nunber of these 
states are equivalent, such as states (2,2,1 ), (2,1,2) and (l,2,2). 

Svery such equivalence class will be called a reduced state. In the 
notation of a reduced state, vre shall generally onit all O’s, e.g., 
state (2,1,0,o) will be witten sinply as (2,1 ). For any given value 
of n, this notation is unanbiguous. 

let us consider a 2 x n systen, in which there are two processors 
and n ^2 nenory nodules. This systen has only two reduced states, 
s^ = (2) and Sg = (l,l). Consider sta.te s^ = (2). At i:he end of a 
nenory cycle, the resultant partial state is (l ) with one free processor 
to be reassigned. This nay be assigned to the sane nenory nodule rd.th 
probability a and to a different nodule with probability (l - a). Thus 
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troiisitions fron state to states s^ and will occur 'jd-th proba.bi- 

lities rvand (l - a) respectively. Following a sixiilar procedure for 

state Sg, the transition mtrix 1 can be slioviKi to be: 

j a 1 - a 

i (1 - o')(Da + u - 2 ) + n( g - 5 ) + 5 - 2ct 

j_ (n - 1 (n - 1 f 

^ ccciputing the steady-state probabilities for this Ivirrlcov chain, 
it can be shorn that the Average Ifeiber of Busy Memory Modules for a, 

2 X n system is given by 

11 ™= (5.2) 

n(n + g - 1 )-1 

let us nov/ study a p z 2 system having tvo memory modules and 
p _>2 processors. Ihis system has Jjp/2__| + 1 stated: 

(p)j (p ” ^ > ”1 )> (p “2,2), ..., ( |_(p+1 )/2_J, jp/2_J) . 

For example, if p =8, then the states are (s), (7,1 ), ( 6 , 2 ), (5,3), end 
( 4 , 4)1 if p = 9, then the states are ( 9 ), (8,1 ), (7,2), ( 6 , 3 ), and (5,4). 

The tra.nsition matrix can nov^ be obtained, and it can be shovoi that 
for a p X 2 system: 


AMS = I ^ (3.3) 

p + 2 g - 1 

If we substitute a = 1 in equations ( 3 . 2 ) and (3.3), vve get respec- 
tively : 


ANBM 


2 


2 

n+1 


(3.4) 


and 


AEBM 



(3.5) 


Foot note : 

|_^z I denotes the largest integer smaller than or equal to x. 


1 
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We shall show in Section 3.5 that viien ® lies in the rcnge 0.75 
to 0.95, Equations (3 *2) and (3 .3) can be approximated, without any 
significant loss in accuracy, with Equations (3.4) and (3.5) respectively. 

Now consider a 3 x n system vvdth three processors and n ^ 3 memory 
uodiiles. This system becomes exceedingly difficult to analyse for an 
arbitrary value of probability a using the method employed in the analysis 
of the 2 X n and p x 2 systems, because of the large number of states 
involved. However, if we assume o' = 1 , the problem is more tractable, 
and it can be showm that for a 3 x n system with = 1 

AEBl = 3 - ^ (3.6) 

That this is a good approximation to the actual value when lies in 
the range 0.75 to 0.95 is borne out by the siiuulation results discussed 
in Section 3.5. 

A general expression suggested by the three Equations (3.4), (3.5), 
and ( 3 . 6 ) is that in a p x m system mth ^ the Average Number of 
Busy Memory Modules should be given by 

AMI = p - ^ = — -SE (3.7 

•"^m + p- l m + p- l 

It should be noted that this equation is symmetric vath respect to ni 
and p. The nature of the actual values of AHBM and the accuracy of 
these approximations will be explored in Section 3*5. 

5*4 Continuous Markov Chain Analysis 

In this section, we shall use a continuous-time Markov chain 
model to analyse the local reference model. Interestingly, the 
exact solution for this method is the sane as Equation (3*7) which 
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was an '.pproxiraate solution for the discreto nethod. 

Continuous-tine Markov chains were used by Bhandarka,r and Fuller 
paa 73] to analyse the unifom reference nodel. To use this technique, 

WG need to abo.ndon the assuuption of constant nenory cycle tine and assune 
that nenory cycle thies a,re exponentially distributed. Although this 
assunption is not very r^^alistic, it nay bo useful to view the resulting 
expression as a lower bound on the systen perfoxnance [ Bha 73 1. Moreover, 
this nethod gives us an expression that is valid for all values of r/. 

The fornulae used here were derived by Jackson [jac 63 and Cordon 
and Nev/ell [i&or 67] j we shall, however, use the notation of Ileinrock 
[He 76, Section 4.12]. Our nodel is now viewed as a. closed queuing 
network with n service centers and p pemanent custoners. Transitions fron 
one center to another are detemined by a routing probability natrix R. 

The elenent r. . of this natrix gives the probability of going to center 

X J 

j on conpletion of service at center i and, in our nodel, is equal to a 
when i = 3 and (l - c')/(n - l) when i ^ j. States are denoted by a 
vector k = (k. ,k-, k ) as in the discrete nodel. The equilibriun 

probability pCk^jk^, ..., k^) is given by 

^ n k. 

p(k^,k2, k,) 

1=1 

#iere 


a(p) w 


E 


n 

n 

i=1 



A is that set of vectors k for which 
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n 

S 

i=1 


\ = P, 


n 


= , X . are the solutions of X = ^ ^ 


"I r 1 ' 1 1 

Liean sGrvice vostos cf “tlie service ceniers. 


Substituting for in the last eQuation^ v;e get 

X . = X . + \ ^ 

1 (.n - 1 ; 2 

c] 


or 


X. 


1 

— S X . 

u-1 3 

3 A 


which gives 


1 


IT 


^i n -1 ^3* 

Thus all X 's are equal and independent of IbroQ here on, the 

i 

analysis is exactly the sane as done by Hiandorkar and Fuller [ Bha 73]* 
Solving for the equilibriun probabilities, vre find that 

p(k^,k2, ..., k^) = 

for all k. Thus, all the states of the systen are equally likely, 

Hiis gives the Average Nuriber of Busy Menories as 

AltBH= 

This equation is identical to Equation (3*7)- It vns first derived 
in [Bha 731 for the unifom reference nodel, i.e., when a - l/u? which 
is a particular case of our derivation. 
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'7e thus find that under the assuaption of exponentially distributed 
nenory cycle tines, the perfornance of the syston is independent of . 
Equation (3-7) nay be viewed as a lower bound on the perforaance of 
real-life systens in viiicii the cycle tine is not exponentially distri- 
buted. Sinilarly, tlie discrete Markov chain nodel gives an upper bound 
since the processing tine of a real-life systen is never a constant and 
is better approxinatod by the exponential distribution [Bha 75l . 

3 •5 Sinulation Results 

Sinulation studies were conducted to validate the local reference 
nodel and to provide the basis for conpoxing the expressions derived 
in the earlier sections of this chapter. The progrms were voritten in 
EOETRAIT IV and riai on an IBI 7044. To find the steady-state systen 
perfornance, the nunber of busy nenory nodules in a cycle v/as averaged 
over a total of 5000 nenory cycles. This anoimted to the processing 
of between 7000 and 33000 instructions (approxinately ) by the nulti- 
processor systen, depending on the nunber of processors and nenory 
nodules. 

Figure 7 shows the Average Nunber of Busy Menory Modules plotted 
as a function of « for various values of p and n. This figure clearly 
denonstrates tho,t for a given nultiprocessor systen, AHBH falls as ® 
increases fron 0 to 1. However, over the range a. >0.75, the varia- 
tion in AHEM is very snail. Thus the systen performance nay be 
accurately represented by the average value of the AHBM figures in this 


range 
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As nentionsd in Section 3.2, a inultiprocessor systea is generally 
expected to have c? >0.75. Ehe averages of the values of AlIBM corres- 
ponding to c(= 0.75 , 0.8 , 0.85 , 0.9, and 0.95 were conputed and these 
are shown in Table I for various p x n systens. These averages are 
conpered vhth values obtained fron Equation (3.7). In no case does 
the error exceed 4 percent. Thus, Equation (3.7) provides a very good 
npproxination for the pcrforciance of real-life systens. The sane is 
true of Equations (3.4), (3.5), and (3.6), since they are particular 
cases of (3.7). 

A conparison of the perfomances predicted by the uniforra reference 
and local reference nodels is shown in figure 8, The values used for 
the discrete Markov chain unifom reference nodel are taken fren [Eha 75], 
vhile those for the local reference nodel are conputed fron Equr.tion (3.7). 
As we sa.w in the previous section, Equation (3.7) forns a lower bound on 
the perfortiance for all values of a. It also forns an approxinate upper 
bound for systens with > 0.75. Thus, for these systens, this equ''tion 
is a, very good estina.te of the perfomance. The upper bound for the 
unifom reference nodel is hi^ier than Equation (3-7); therefore the 
perfomance predicted by this nodel is generally nore optinistic than 
it would be for real-life systens (with c> >0.75). Sinulation results 
for the case a = 0 are also shown in figure 8. It is evident fron 
the figure that the perfomance of nultiprocessor systens would be inproved 
if prograns have addressing patterns that vrould nake c close to 0. 
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3 *2 UrafoiD Reference Model 

In this section, we shall derive expressions for the unifora 
reference nodel using the nethod followed in Section 3 •3. Since the 
local reference nodel becones equivalent to the unifom reference 


nodel upon substituting a = l/n, we get straightav^ny fron Equations 
(3*2) and (3.3) that for unifom reference models 

AIJHM = 2 - 1/n (3.8) 

and AM = 2 - 1/p (3-9) 

for the 2 X n and p x 2 systens respectively, 

A similar discrete Markov chain analysis my be done for the 
3 x: m system (m ^3) to give 


ANBM = 3 - ^ + „--x- 

n 3 _ 2 


(3-10) 


n -n -to 

in the case of unifom reference nodels. Note that all three expressions 


(3.8), (3,9), and (3.10) are exact. In (3 .IO), the last tena beccnes 
small vdien m increases, so v/e no.y write a,s an approximation: 

Ami = 3 - 3/n (3-1 1 ) 

Following a similar process, for a 4 x n system (m' ^4) we find that‘s 


ahem = 4 - - + 

m 




12n + 14u - 12 


/ 5 4 3 2 \ 

m (m"^ -3m +8u -1 1m +6n-4 ) 


(3.12) 


\tien m is large, Equation (3.I2) nay be approximated by 

mm = 4 - 6/n 


(3.13) 


Footnote : 

^ We owe the correct foim. of this equation to Dr. D.P. Ehnndarkar. 
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A genciral expression, suggested by Equations (3’Q), (3.1 i)? and 
(3.13), is that in a p X n systen (n > p) the Average Nuniber of Busy 
llenory Modules (for a unifom reference nodel) sho^lld approxLnately 
be given by 

ANBM = p - p(p-l)/2n (3-»U) 

Ho\7ever, since we kno\? fron [Bha 75 1 tho.t the perf onaances of an A x B 
sjrsteu and a B x A systa.! are alnost equal, vre nay varite 

ANBI = D - (3.15) 

where i = Liax(n,p) and j = nin(n,p). 

Let us now caipare expression (3.15) with two others available in 
the literature for the -unifom reference nodel. first, Strecker's 
expression [Str 70l, as nodified by Bhandarkar [Bha 75], is: 

ANH/I = i [ 1 _ (i _ 1)^ 3 (3.16) 

where i = nax(n,p) and j = nin(n,p). Second, an asynptotic expression 
given by Baskett and Snith pas 76 ] is; 

AWm = u + p - (n^ + p'^)'^ ( 3 . 17 ) 

In order to conpare expressions (3.15), (3.16), and (3.17), wo ha.ve 
used the exact nunerical results given by Biandrorkar [Hia 75] and beyond 
Bhandarkar’s with results obtained in the sinulction study described in 
Section 3.5 (vri.th l/n substituted for ri). Table II (a) gives the values 
of AIBM for p X n systens -with 2 ^p _<10 and 2 12, for p ^ 8 

and n ^8 we have used the exact values of Bhandarkar. The rest of 
the entries were obtained by sinulation. The values obtained fron 
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Equations (3 *16), (3.17)j raid (3.15) and their coaparison mth the exact 
results are shoTOi in Table II. 

It can be seen that Baskett and Saith’s expression (3.17) is highly 
inaccurate for snail values of n and p. Its accuracy improves as n and 
p increase. Both Bhandarkar's expression (3.16) and our expression 
( 3 . 15 ) improve in accuracy as j increases for a constant i. Equation 
( 3 . 15 ) is by far the best of the three for all values of m and p, except 
when n and p are large and nearly equal. In this range and only in 
this range is Equation (3.17) better. 

3 .T Conclusions 

The uniform reference model has been extensively studied in the 
literature because of its simplicity. Hov/everj it does not provide 
a good approximation to the performance of real-life systems in which 
programs have strong locality of reference. The local reference model 
proposed, in this chapter explicitLy models this property vhich charac- 
terizes a maoority of real-life computer programs. Our results show 
thal the performance of such programs in a multiprocessor system is 
signif i ccontly worse than Tdiat is predicted by the tmiforn reference 
model used earlier. It vrould thus be worthwhile to make serious efforts 
in designing programs with uniformly random addressing patterns. Efforts 
have been made (such as [Per 761 ) in the context of multiprogramming 
environments to make programs more local in behavior, ^he opposite 
of this is needed for multiprocessors. Eesearch in the design o± progran 
mth unifona addressing patterns could give valuable results and would 
help in improving substantially the performance of a multiprocessor syste 



CPIAPTER 4 


MODEL POE TIME-SH/iEED BUS SYSTETiIS 

In this chapter, vie present a discrete Markov chain model for a multi- 
processor with a time-shared bus as shown in Figure 3- After describing our 
notations in Section 4.1, we discuss the major assumptions of the model in 
Section 4.2. The analysis of the model is outlined in Section 4.3. An 
example is given in Section 4.4 to explain the analysis technique in detail. 
Section 4-5 presents simulation results for the example of Section 4 A to 
show the accuracy of the analytical results. 

4.1 Notations 

We shall consider a multiprocessor system with p processors and m memory 
modules. The memory modules are of two typos as explained in the assump- 
tions to follow in the next section; there are m^ memories of type 1 and 
(= m - m^ ) memories of type 2. The processing time of a processor vdll be 
denoted by t^, the cycle times of the two types of memories by t^^ and ^^ 2 ^ 
and the bus cycle time by t^. The symbols a. and 3 (= 1 - «) stand for the 
paranieters of the processing time distribution while Y^, 1 - Y ^ 

Ygj and 1 - Y 2 ) are parameters of the memory cycle time distributions. 

The average processing time and the average memory cycle times are \l 
and t ^2 nespectively. The unit instruction execution rates of instructions 
executed from the two types of memories are denoted by UEE1 and UjiIR2, ,the 
utilization factors of these memories by Oj and p^, and the total unit 
instruction execution rate cf the system by USE. The states of the system 
vdll be represented by vectors whose cemponents are denoted by k^yk^jkg, 
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^21 element T(i,j ) of the matrix T stands 


1 ’ 2 ’ • • • > 

for the transition probability from state s. to state s.. The equilibrium 

X ^ 

probs-bility of state s^ is denoted by z(i), 

4.2 Assumptions of the Model 

The following major assumptions characterize the model developed in 
this chapter: 

Assumption 1 : The p processors and m memory modules of the system are 
connected by a time-shared bus with constant cycle time t^. In one cycle 
time, the bus transfers information from a processor to a memory or vice 
versa. 

Assumption 2 : The processing times of all processors are identically geo- 
metrically distributed with the distribution 

Pr [tp = i.t^] = g. Cl, i = 0,1,2, ..., 


vnere S =1 - c'. 


The mean processing time is given by t = a.t /B. Thus any mean value of 

p D 

tp can be considered by appropriately choosing ot. That this is a realistic 
assumption has been shov/n by the models of Bhandarkar [Bha 75] who also 
argues that the processing time of a real-life processor has a distribution 
quite close to exponential. 

Assumption 3 • The memory modules sc&e of tw?o types: there are m^ memories 
of type 1 and m^ (= m - m^ ) memories of type 2. The access time of all 
memories is equal to their cycle time and the reMvrite time is zero (see 
Figure 6(b)). The cycle times of all memories of type j ( j = 1,2) are 
identically geometrically distributed with the distribution 





1 j2 } 


where o . = 1 - y. 

0 '3 


She mean cycle time of a memory of type j is given t . = 

CQ b 3 

difference betvreen these distributions and the processing time distribu- 
tion should be noted. The processing time takes on values 0, t^, ^"^b’ 

vdicreas memory cycle times are not permitted to take value 0. As before, 
any mean value of t . can be chosen by appropriately choosing y . . 

Assumption 3 makes our model more general than one which has all memory 
modules identical. The reason for choosing two different types of memories 
is that such a model vdll be needed for analysing the Pluribus system in 
the next chapter, where local memories and global memories may have different 
cycle times. Note that we can always choose t^^ = t^^ nig = O) as a 
particular case. 

Assuming the memory cycle time to have a geometric distribution has 
been done to keep the analysis of the model tractable. \7e do not want to 
claim that this is a realistic assumption. Memory modules generally do have 
a constant cycle time. However, the model is robust enough for this assump- 
tion to make no significant difference. This vdll be borne out by the simu- 
lation results presented in Section 4.5 to validate this model. 

A ssumption 4 : All processor requests to memory are read operations , The 
processor transmits the read address over the bus to the memory module. 

A memory module can fetch only one word in one cycle. Once the data has 
been fetched, it is transmitted back over the bus to the requesting processor. 
However, the memory module is not held up viiile the data is being transmitted 
and is free to take up a new request as soon as the previous cycle is over. 



A ssumption 5 ; Hie requests of any one processor are uniformly distributed 
over all memory modules, regardless of their type. 

A ssumption 6 •• Each memory module has a queue in vdiich processor requests 
are queued. After a request has been served, the next request is taken 
from the queue if it is not empty. 

Aissumption 1 Hie time-shared bus has t"'o queues, one each for the traffic 
in the two directions. The traffic from memories to processors (in queue 1 ) 
has priority over that from processors to memories (in queue 2 ). 

The structure of this queueing model is shown in Eigure 9. 

4.3 Analysis of the IJodel 

We shall now model the process defined by the assumptions of the pre- 
vious section as a discrete Markov chain"^ . At any given time, the state of 
tho system can be characterized by the lengths of the queues at the time- 
shared bus and the memory modules. This state is denoted by the vector 

(^O’S’^2’ ^11 ’^12’ ***^ ^21 ’^22’ 

viiere = number of processors in processing state? k^ = length of queue 1 

at the bus; k =length cf queue 2 at the bus, k . = number of processors 

waiting in the queue of the i-th memory module of type 3 (including the 

processor being seiwed), i = 1,2, ..., m., and j = 1,2. Note that k^ or 

J 

kg also includes the item being served by the bus, depending on the queue 
to which it belongs. The component k^ of the vector is actually redundant 

footnote ; 

i 

For a description of Markov chain processes, 
and Bhandarkar [Fna 75] • 


see ICLeinrock [ICLe 75] , 




PIGUEE 9: Structure of the que-uing nodol for 
tine-shored bus systens. 
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because 


ICq + 


2 

E 

3=1 


m. 


'3 
E 

i=1 


31 


= PI 


it is included for the sake of convenience. 

Since all processors are identical, and all nemory nodules of a 
particular type are also identical, a number of these states are equi- 
valent, e.g., states (l,0,0j 1,2,0; 1,o), (l,0,0; 2,0,1; 1,0), 

(l,0j0; 1,2,0; 0,1 ), (l,0,0| 2,0,1; 0,1), etc. Bvery such equivalence 
class Mill be called a reduced state. 

State changes occur only at the end of bus cycles. Ilius if we con- 
sider transitions between states at points just prior to the end of bus 
cycles, we get a discrete-parameter Markov chain. The procedure for 
analysing this model is as follows: 

Given the values of pjCt^jm^, and y I'ke reduced 

states must be enumerated and the transition probabilities T(i,j ) between 

every pair of states (s. ,s.) must be found. The method of enumerating the 

1 3 

reduced states and finding the transition probabilities is similar to that 
employed in the crossbar svdtch models of the previous chapter; a syste- 
matic technique for it has also been given by Bhandarkar [ Bha 75] . The 
equilibrium state probabilities z{±) can now? be evaluated by solving the 
set of equations 

z(i) = S z(j ) . T(j,i) (4.1 ) 

3 


^ z(i ) = 1 !^4.2 ) 


together with 
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The unit instruction execution rate (UER) or any other perf orniojice measure 
one is interested in can be calculated from a knowledge of the eQuilibrium 
stabe probabilities. 

This outline of the analysis method will be explained in detail by 
the example of the next section, 

4.4 An Example 

Consider a system vdth 2 processors and 3 memory modules, 2 of viiich 
belong to type 1 and one is of type 2 . Thus p = 2 , = 2 , and = 'i . 

Tor this case, there are 16 reduced states: 

s^ = (2,0,0j 0,0| o), Sg = (lj1j0> OjOy ®)5 

s j = (l,0,l5 0 , 0 j O), s^ = (1,0,05 1,05 0), 

Sj ~ (ijOjOj CjOj 1), Sg = (0,2,05 O5O5 0), 

Sy = (0,1,15 C5C5 0), Sg = (0,1,05 1,05 0), 

Sg = (0,1, 0| 0,05 1), ^10 ~ (^5^>25 0,05 0), 

s^ ^ ~ (0,C,1| 1,05 0), ^12 ~ 8,85 1), 

s^^ = (0,0,05 2,05 0), ^14 ~ (8,0,05 1,1 I c), 

s^^ = (0,0,05 1,05 1), s^g = (0,0,05 0,05 2). 

The transition matrix for these states is given in the Appendix in 
terms of parameters 01, , Y2J ^ = 1 - 0, 6 ^ = 1 -Y-]» = 1 -Yg* 

A. computer program can now be written which, given values for these parameters 
vdll solve the set of simultaneous linear equations ( 4.1 ), ( 4 . 2 ), and compute 
the equilibrium state probabilities z(i). The unit instruction execution 
rate (UER) con nov? be found in the following way. 
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During any given period of tine, every instruction executed by the 
systen keeps the bus busy for two cycles; once for transferring the address 
to the memory, and again for transnitting the data back to the processor. 
Hence the sun of the equilibrium probabilities of those states during which 
the bus is serving an item from queue 1 will give the average number of 
instructions executed in one bus cycle of duration t^. These states are 
s^jSgjS^jSg, and Sg. Alternatively, ve nay find the sum of the equilibrium 
probabilities of those states during v^hich the bus is serving an item from 
queue 2. Since queue 1 has priority over queue 2, these states arc s^, 
s^Q, s^^, and s^^* ^et, 

UER = (z(2) + z(6) + z(7) + z{&) + z(9))/t^ 

= (z(3) + z(lO) + z(ll) + z(l2))/t^. 

V/e are also interested in finding UER1 and UiIR2, the average rates of 
instructions executed from memories of type 1 and from memories of type 2 
respectively. Clearly, UER = UER1 + UER2. 

The sun of the equilibrium probabilities of all states weighted by the 
number of memories of type j busy during tho-t state -gives tho utilization 
factor P. of tliose memories. Thus 

t) 

= z(4) ■+■ z(8) + z(ll) + z(l3) + 2.z(l4) + z(l5) 
and Pg = z(5) + z(9) + z(l2) + z(l5) + z(l6). 

Since the average cycle time for memories of type j is t^^ = 

one instruction is fetched by one memory nodule in one cycle, the average 

rate of instructions executed from nenories of type j 


UERd = Pj/tej "" *^3 ^3'^'^b* 
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T/OBLE III 

EESULTS FOR TIIffi-SHARED BUS ISODEL 





p = 2, 

— 2 y — 1 y 

t . = 600 
cl 

nsec . 


nscG 

nsec 

^C2 

nsec 

Annlytic Results 


Sinulati on 
Results 

UM 

insts/ 

Usee 

Percentage 

.‘Irror 

UER1 

insts/ 

U*eec 

UnR 2 

insts/ 

Psec 

USE 

insts/ 

usee 

120 

150 

300 

1 .2514 

0.6257 

1 .8772 

1 .928 

2.71 



600 

1.1309 

0 .5655 

1 .6964 

1.785 

5-22 



900 

1 .0193 

0.5097 

1.5290 

1 .593 

4.19 


300 

300 

0.9122 

0.4561 

1 .3683 

1.394 

1 .88 



600 

0.8514 

0.4257 

1 .2772 

1.337 

4 - 68 



900 

0.7926 

0.3963 

1 .1889 

1 .232 

3.63 

240 

150 

300 

1 .1455 

0.5728 

1 .7183 

1 .758 

2.31 



600 

1 .0433 

0.5217 

1 .5658 

1 .624 

3.77 



900 

0.9484 

0.4742 

1 .4226 

1.473 

3.54 


300 

300 

0.8630 

0.4315 

1 .2945 

1-304 

0.73 



600 

0.8070 

0.4035 

1 .2105 

1 .248 

3.10 



900 

0.7535 

0.3767 

1 .1302 

1 .165 

3.08 

360 

150 

300 

1 .0508 

0.5254 

1 .5763 

1 .621 

2.84' 



600 

0.9643 

0.4822 

1 .4465 

1 .482 

2.45 



900 

0.8834 

0.4417 

1 .3251 

1 -382 

4.29 


300 

300 

0.8130 

0.4065 

1 .2195 

1.235 

1.27 



600 

0.7626 

0.3813 

1.1439 

1 .176 

2.81 



900 

0.7146 

0.3573 

1 .0719 

1 .095 

2.16 







TABLE III 

(Continued ) 


j 1 

480 

150 

300 

0.9681 

0.4841 

1 .4522 

1 .479 

1 .85 



600 

0.8944 

0.4472 

1 .3416 

1,386 

3.31 



900 

0.8251 

0.4126 

1 .2377 

1 .282 

3.53 


300 

300 

0.7657 

0.3829 

1 .I486 

1 .150 

0.12 



600 

0.7207 

0.3603 

1 .0810 

1 .097 

1 .40 



900 

0.6777 

0.3388 

1 .01 65 

1 .032 

1 .52 

600 

150 

300 

0.8960 

0.4480 

1 .3441 

1 .385 

3.04 



600 

0.8328 

0.4164 

1 .2492 

1 .279 

2.39 



900 

0.7730 

0.3865 

1.1595 

1.175 

1 .34 


300 

300 

0.7221 

0.3610 

1 .0831 

1 .083 

0.01 



600 

0.6818 

0.3409 

1 .0227 

1 .045 

2.18 



900 

0.6433 

0.3217 

0.9650 

0.976 

1.14 

1200 

150 

300 

0.6472 

0.3236 

0.9709 

0.955 

1 .64 



600 

0.6143 

0.3071 

0.9214 

0.931 

1 -04 



90 J 

0.5824 

0.2912 

0.8736 

0.885 

1.30 


300 

300 

0.5552 

0.2776 

0.8328 

0.828 

0.58 



600 

O.53I2 

0.2656 

0.7968 

0.818 

2.66 



900 

0.5080 

0.2540 

0.7620 

0.766 

0.52 

1800 

150 

300 

0.5040 

0.2520 

0.7560 

0.751 

0.66 



600 

0.4841 

0.2420 

0.7261 

0,727 

0.12 



900 

0,4646 

0.2323 

0.6970 

0.682 

2.15 


300 

300 

0.4477 

0,2239 

0.6716 

0.685 

2.00 



600 

0.4321 

0.2161 

0.6482 

G.64O 

1.27 



900 

0.4169 

0.2085 

0.6254 

o.ep, . 

■ 







SI' : 
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2400 


3000 


T/lSLE III 


150 

O 

o 

0.4119 

0.2060 


600 

0.3987 

0.1994 


900 

0.3857 

0.1928 

300 

300 

0.3742 

0.1871 


600 

0.3633 

0.1016 


900 

0,3526 

0.1763 

150 

300 

0.3481 

0.1740 


600 

0.3386 

0.1693 


900 

0.3294 

0.1647 

300 

300 

0.3210 

0.1605 


600 

0.3130 

0.1565 


900 

0.3052 

0.1526 


52 

(Con tinue a) 


0.6179 

0.608 

1 .60 

0.5981 

0.595 

0.52 

0.5785 

0.575 

0.61 

0.5612 

0.542 

3.42 

0.5449 

0.548 

0.57 

0.5290 

0.536 

1.32 

0.5221 

0.527 

0.94 

0.5080 

0.505 

0.59 

0.4940 

0.490 

0.81 

0.4815 

0.483 

0.31 

0.4695 

0.475 

1 .17 

0.4578 

0.447 

2.36 
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Table III shows the values of UER, UER1 and UER2 conputed fron this 

no del for t^^ = 600 nsec; and various values of t^, ^c2’ "^b* ^ close 

observation of this table shows that 

UERj = HER . n^/(n^ + n^). 

This actually follows fron Assunption 5, nanely that the processors^ requests 
are imifornly distributed over all nenory nodules# The difference in speed 
of the tv 70 types of neiiories will affect the total instruction execution rate^ ^ 
but each nernory nodifLe vnll continue to receive its share of processor 
re quest So 

4 *5 Sinulation Result s 

Sinulation studies Y/ere conducted to validate this model for tine— 
shared bus systens. The sinulation program assumed censtrnt maiory cycle 
tines so that we could estimate the effect of Assunption 5 (geometrically 
distributed memory cycle tines). The progrcoi was written in ROETRAIT IV and 
run on an IBM 7044. To find the steady-state system performance, the ins- 
truction execution rate ?/as averaged over a total of 5000 memory cycles. 

The results are shown in Table III along with the analytical results. It 
can be seen that in no case does the error exceed 6 percent. Thus we can 
say that the assunption of geometrically distributed memory cycle times 
is acceptable. The closeness of the analytical and simulation results 
demonstrates the usefulness of the model# 



CHAPTER 5 


MODEL POR ELURIBUS SYSTSiS 

In this chapter, we develop a nodel to analyse the perfomance of 
the Huribus i.iultiprocessor systeu viaich wa,s described in Chapter 2. 

The notations used in this chapter are described in Section 5.1* The 
uajor assuaptions of the nodel arc outlined in Section 5.2. Section 5.3 
gives the detailed analysis of the nodel. The a.nalytic results are pre- 
sented in Section 5.4 and are conpored with the sinulation results in 
Section 5 .5 . 


5 .1 Rotations 

Por the Pluribus nodel in this chOupter, we shall consider the systen 
configuration shown in Figure 10, We assune the follomng paraneters for 
thi s nodel ; 

n^ = nunber of processors on each processor bus, 

n,^ = nunber of local nenory nodules on each processor bus, 

n = nunber of global nenory nodules, 
ng 

n^^ = nunber of processor buses, 

tp = processing tii:ie of each processor, 

t^^ = cycle tine of each local nenory nodule, 

t = cycle tine of each global nenory nodule, 
eg 

and t^ = bus cycle tine. 

It should be noted that the systeu uodelled here (Figure 10) differs 
in one respect fron the actual configura.tion of the Plirribus systen shown 
in Figure 5 . The crossbar switch, instead of connecting to nenory buses. 
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GlobcJ. 

nenory 

nodules 


PIGUSE 10; Pluribus configuration for the nodel 
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liGro connects directly to globcl nenory nodtiles. 

These two systems nay be considered eguivalent throu^ the following 
transf omation : let the system in Pigiire 5 have n nenoiy buses with bus 
cycle tine t^^^ and each having n^^ nenory nodules, let the cycle tine of 
these nenory nodules be This system is then equivalent to the one 

in Figure 10 where the nunber of global nenory nodules is n^^ = . n^^ 

vdth nenory cycle tine t = t + 2t , . Eie tern 2t , in the cycle tine 
reflects our assunption that two bus cycles are necessary to access one vrard 
fron the nenories : one for transferring the address to the nodule and the 
other for transmitting the data back to the processor. 

This transf omati on is approzinate, hovrever. It neglects the queinng 
delays due to conflicts for the nenory buses. If the nunber of nenories 
per bus is not large and the buses are fast enou^ij this approzination is 
justified. 

5 .2 As sumptions of the Model 

The following major assumptions will be made for this model: 

Assunption 1 : Tho crossbar switch has zero delay. Hovrever, crossbo,r switches 
mth nonzero delay nay be modelled by sinply a,dding the delajr to 
global nenory cycle time. 

Assunption 2 : For all nenory nodules j local as well as global; the access 
tii^e is equal to the cycle tine and the rewrite time is zero. (See Figure 
6(b)). 
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Assunption 5 ° Any particular processor can access the local nenory nodules 
on its OYm. bus as well as all the global nenory nodules. These accesses 
have the follovang distribution : Each local nenory nodule is accessed 
va.th probability l/(n^ + 1 ). Each global nenory nodule is accessed yath 
probability l/(n^^^ (n^.^ + l)). There are two reasons for assuning this 
distribution: Eirstj in any conputer systen yhere nenory is divided into 
local and global parts^ accesses from local nenory are genurally nore fre- 
quent than those fron global nenory^ The distribution ve have assuned 
conforas to this pattern' the total probability assigned to accesses fron 
all the global nenory nodules is equal to the probability assigned to 
accesses fron one local nenory nodule. The second reason is that, as vd-ll 
be clear in the next section, this distribution will pernit us to use un- 
changed the bus nodel developed in chapter 4. 


5 *3 Analysis of the Model 

Vfe no\/ give a method for analysing the performance of this nodel of 
the Pluribus systen. lie have seen in Chapter 3 th^it there are a number of 
models for analysing the crossbar switch systen. lie Iia.ve also developed, 
in Chapter 4, an analytic model for the tine— shared bu>s systen. Since the 
Pluribus is a composite of a crossbar switch and tine- shared buses, v/e 
shall try to decompose it into these components and use tne models for 
these to analyse the perfomance of the vdiole. However, the bus and the 
crossbar subsystems are not isolated^ there is an intera-ction between then 
and our nodel must take this into account. This is done in the following 


manner. 
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Consider a processor bus; let us call it B1 . This bus interacts 

vd-th the rest of the systen (nanely, the renaining processor buses, the 

crossbar svatch, and the global uenory nodiiles) by sending requests for 

accessing the global nenories. These requests are serviced after a certain 

tine, which depends on the global nenory cycle tine and the interference 

due to requests frcai other processor buses. Thus, as far as bus B1 is 

concerned, the rest of the systen sinply looks and acts like a nenory 

nodule i‘ri.th a certain cycle tine, say t^^. This is true for each of the 

processor buses. Hence the Pluribus systen nay be represented as shown 

in Figure 11. The systen now consists of n^^ independent processor buses, 

Gp.cli with n processors vdth processing tine t^, n^ local nenories vdth 

1 

cycle tine t^^, and a 'virtual' nenory nodule with cycle tine t^^, which 
replaces the rest of the systen. 

The problea is to detemine the average value of this cycle tine t^^ 
of each of these 'virtual' nenories. To do this, let us look at the systen 
fron the viewpoint of the crossbar switch. As far as the crossbar sxra-tch 
is concerned, each processor bus sinply acts as o. processor. It sends a 
request for accessing data fron global nenory, and after the request is 
so.tisfied, takes a certain average processing tine, say before sending 

Footnote : 

The tern 'virtual' nenoiy used here denotes the fact that this is not 
a, real nenory nodule and should not be confused with its conventional neaning 
in caiputor systens architecture. 
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Processors 




iioeal Meaory 
Modules 


Processor 

Bases 




PIG-UEE 11 : Pluribus systea with ’virtual' nenory noduilGS (nv) 
replacing the crossbar switch conponent. 
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PIGUEE 12: Grossbca’ conponent 'of the ELtiribus systen 
Tdtli 'virtual’ processors (PV) replacing 
each processor bus. 


'Virtual ’ 
Processor 



’ Virtual ’ 
Eienory nodiiLe 


PIGIXRE 13: Interaction between the bus and crossbar conponent s 
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Clobnl uonorios. In Figure 13? let l('fcp^? '^'cv^ rate 

’virtual ’ processor executes instju.ctions iron the 'virtual 


Cl early, 


b( 


n , 
P 


n 


□ 1 ’ 


^cl’ ‘'b’ 


t ) 

cv 


“Sb’'' 


ng' 


t , t ) = 
eg’ pv 


at which the 
’ nenory. 


d(t 


pv 


t 

cv 


) 


(5.1 ) 

If the functions b,c, and d are knowm, then we have a set of tw'o 
cciuations with twro unknovms, t and t . These con be solved for end the 

^ pv C V 

performance of the total systen is then found ns follow?s. 

As We noticed at the end of Section 4.4 of Chapter 4? for ea,cn pro- 
cessor bus, the rate of instruction execution from all neaorj’’ uodiiLes 
(including the 'virtual' nenory) is the somej it is thus given by * 


Hence the total unit instruction execution rate for each bus is 

(n^ + 1) . d(tp^,t^^). 

Since there are n buses in cil? the total HER for the Pluribus systen is 

Jr ^ 


HER = » (’^nl ^ ^ ^ "^cv^ 


(5.2) 


It now renains to solve the systen of ECnations (5 .1 )* Ir. order to do 
this, we oust know the functions b,c, and d. The function d is simply given 


by 

d(t ? t ) = l/(t + t ) ( 5 . 3 ) 

^ pv’ cv^ ' pv cv’ 

For the crossbar switch, wre nay use Strecker's model [ Str 70] described 
in Chapter 3 . The function c is then given by Equation (3.1)5 this equation 


is repeated here: 
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n 


c(»,,n ,t ,t )=(n /t )(l-(l-P/n 
pb’ ng’ eg’ pv'^ ^ ng cg^^ ^ ri ng^ 


v/hcre 


n 


- (■' - V“ng) ” ^ ’ = ° fc.4) 

The value of this function can be ccaputed for any given argunent values 
by solving the above equation. 

However, the function b is not known analytically. It can be conputed 
for any given argunent values by using the no del developed, in Chapter 4 


(\^th p = n^, = n^^ , n^ = 1 , t^ = t^, t^.^ = t 


p' "1 ■‘nl’ ^2 ' p 


^cl’ c2 


= t , and t, = t, . 


The YCilue of b is sinply UEE2). 

Keeping into viov/ tho consideration that not all the fmetions are known 
analytically, it is possible to solve Eluc-tions (5 .1 ) by the follovang itera-- 
tivo nethod* 


Choose an initial value for shall choose t^^ = t^^- 

t fron the equation 
pv 

b(n , n - , t , t , t. , t ) = d(t , t ) = 1/(t + t ) 

p^ uL^ p^ cl^ b^ cv pv^ cv'^ pv cv 


Solve for 


i*ei 


t = ^ t 
pv b cv 

Using this value of t , solve for t fron the equation 

o py7 QY 

c(n,,n ,t ,t )=d(t ,t ) = l/(t + 

\| 7 ra rr 7 r\ (tr 7 r\-tr ^ '■ Y^‘cr' r^'\r ' ^ "ntr 0^ 


pb^ ng^ cg^ pv 


pv cv 


pv 


t 

cv 


t 

P'V' 


1 

c 
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One iteration is no\j conpleted. Use this value of t^^^ to begin the 
next iteration, Eepeat this process until tiae value of t^^^ colciiLated at 
the end of an iteration is equal to the value of t^^ with which the itera- 
tion was begun. 

The convergence of this process v/ill, of course, depend on the func- 
tions b and c. These functions are not imovm onalytically , so it is diffi- 
cult to’ say anything definite about the convergence. Since the proof of 
the pudding is in its eating, vje wrote a conputcr prograii to inplenent the 
ccoplcte nethod described in this section. The function b v;as calculated 
by solving nunerically the set of sinultaneous linea.r equations (4.1), 

(4,2 ). The function c v/as conputed by nunerically solving Equation (5 *4) 
using the bisection iiethod (it is known thal the root lies betvreen 0 and 1 ) 
In the nore than four hundred cases on which we tried this nethod, it never 
took nore than 8 iterations, senetines taking only 2 iterations. The 
POETPAN IV progran took about 2 ninutes conpilation tine and execution tine 
of the order of 2 seconds for each analysis on an IBM 7044. A fev^ sanple 
outputs are shorn in Table IV. 

5 .4 Analytical Results 

Since the Eluribus nodel has 8 input paraneters, each of vhich can very 
over a fairly wide range, it is inpossible to obtain results covering the 
whole paraneter space or even a significant part of it, Por this reason, 
we United ourselves to investigating the pai'oneter space in the vicinity of 
the values of the actual Huribus systen built by BBU. This systen is 
characterized by the follovang paraacter vclues: n^ = 2, n^ = 2, n^ - 4, 

^pt = \ nsec., t^^ = 850 nsec., t^^ = 1250 nsec., t^ = 200 nsec. 
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TABLE lY 

SAIgLE OUTPUTS OP ITER-JIVE l/BTHOD FOR AIL''J.YSIIIG 
PIURIBUS SYSTH-I 


Paraneter 

Values 
(all tines 
in nsec . ) 

Bo, of 
itera- 
tion 

t 

CY 

nsec 

b 

insts/ 

Vsec 

t 

pv 

nsec 

c 

insts/ 
y sec 

t 

CT 

nsec 

Total 

UER 

insts/ 

Vbec 

n = 2, n T = 2, 
p dI ’ 

1 

1250 

0.2265 

3166 

0.2098 

1602 


n = 4, n = 7, 
na ’ pb ’ 

2 

1602 

0.2155 

3038 

0.2149 

1614 


t = 1425, t .=850, 5 

p cl 

1614 

0.2152 

3033 

0.2151 

1615 


t = 1250, t, = 200. 4 

eg ’ b 

1615 

0.2151 

3033 

0.2151 

1615 

4.5180 

n =2, n T =2, 
p nl 

1 

850 

0.2724 

2821 

0.2391 

1361 


n =2, n =7, 
ng ' pb ^ 

2 

1361 

0.2505 

2651 

0.2479 

1403 


t =1425,t ^=400, 
p ’cl ’ 

3 

1403 

0.2488 

2616 

0.2486 

1407 


t =850,t =200. 
eg ’ b 

4 

1407 

0.2487 

2615 

0.2486 

1407 

5.2217 

n =2,n - =2, 
p ’ ml ’ 

1 

425 

0.2539 

3513 

0.2523 

450 


n =4 , n , =5 , 
ng ’ pb ’ 

2 

450 

0.2530 

3502 

0.2530 

450 

3.7957 


t =1425, t .=850, 
p ’cl 



t =200. 
b 
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TAHEiE lY (COFTIIJUED) 


C\j 

!! 

C\i 

II 

1 

1250 

0.1965 

3838 

0.1878 

I486 


n =A,xi = 6 , 
ng ’ pb ’ 

2 

I486 

0.1912 

3745 

0.1910 

1492 


t,^=1425,t^^=850, 

t^g=1250, t^=400. 

3 

1492 

0.1910 

3742 

0.1911 

1492 

3.4388 

n =2,n ^=2. 
p ' nl ’ 

1 

1250 

0.3059 

2019 

0.2396 

2154 


n =4,n ,=10, 
ng ' pb ^ 

2 

2154 

0.2562 

1749 

0.2490 

2254 


t =600, t ,=850, 
p ’ cl ’ 

3 

2254 

0.2514 

1724 

0.2508 

2264 


t =1250, t =200. 
eg ^ b 

4 

2264 

0.2509 

1721 

0.2509 

2265 



5 

2265 

0.2509 

1721 

0.2509 

2265 

7.5259 

n = 2,11 , =2, 
p ’ nl ’ 

1 

1 000 

0.2345 

3264 

0.1422 

3769 


n =1 ,n =7? 
ng ’ pb ’ 

2 

3769 

0.1613 

2431 

0.1428 

4573 


t,^=1 425,7^^=850, 

3 

4573 

0.1463 

2263 

0.1428 

4740 


t^g=1 000,7^=200. 

4 

4740 

0.1434 

2231 

0.1428 

4771 



5 

4771 

0.1429 

2226 

0.1428 

4776 



6 

4776 

0.1428 

222S 

0.1428 

4777 



7 

4777 

0.1428 

2224 

0.1428 

4778 



8 

4778 

0.1428 

2224 

0.1428 

4778 

2.9990 
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ifote th.-'.t and o-rc obtained by applying the transf ornation nentioned 

at the beginning of Section 5.1 (vdth n,=2,rL =2,t,= 200 nscc., 

nb ’ bg ’ nb ’ 

"^ciig ~ nsec.) The value for t^ is the processing tine per unit ins- 
truction (see definition of unit instruction in Section 2.5) or.-'! is obtain- 
ed fron the fact that the Lockheed SUE processors used by 3311’ have a 3*7 
p sec add or load tine (we took this to be a typicrO. instruction) which is 
equivalent to two unit instructions containing two accesses fron nei.'.ory 
(vT/ith access tine 425 nsec.) 

These results ore presented in figure 14. The ports of this figure 
represent various cases of interest and are discussed below. 

14 (a) and (b) : It is evident fron these figures that no signi- 
ficant change in the perfomance is effected by changing the processor 
speed, for a constant processor speed, however, the perfomance increases 
alnost linearly with the nunbor of processor buses. This shows that the 
global UGDories do not constitute a bottleneck in the ^sten. llote tho.t 
these figures are for 0 . systen vd.th 4 global nenories. \‘kien the nunber 
of global nenories is less, hovrever, this statcuent does not hold, as we 
shall see belowr. 

figures 14(c) and (d) : An observation sinilar to the above holds for the 
bus speed. These figures shov; that the effect of varying the bus speed 
is negligible as conpared to the effect of varying the nunber of buses. 
figure 14(e ) : This is a very interesting figure. It shows that increasing 
the nunber of global nenories fron 1 to 4 inproves the perfomauce signi- 
ficantly, Beyond tha.t, however, it is useless to further increase this 
nunber. On the other hand, increasing the global neuory speed considerably 



UEK in iusto/nicrnsocc — ^ TIER .in inats, uicroscc 



(b) 


EIGURE 14: 


ilncilytical results for the ELuribus nodol. 
(all tines in nsec . ) 
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(c) 



(d) 


EIGUEE 14: (Contimed) 

(all tines in nsec . ) 


o 




UER in insts/nicrosec , 
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(f) 


PiaUIlE 14: (Continued) 

(nil tines in nsoc . ) 





in ijists/nicrosoc 
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(g) 


PIGDRE 14: (Continued) 

(all tines in nsec. ) 


O cn CD 



UER in insts/Dicro 
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UER in insts/riicri. 
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(i) 


PIGIIRE 14: (Continued) 

(oil tines in nsec.) 



75 


inproves tiio perfomance irrespective of the nunber of globcJ. aenories. 
gi,?ur0 14(f) ° This figure shows hov*^ the perfcmance varies v/ith the local 
ncriory speed for various global nenory configurations* Clearly j 2 global 
nonories with cycle tine 850 nsec, are better than 4 of 1250 nsec. The 
local nenory speed also has an inpact on the perfonaance. 
ligure 14(g) and (h ) g These figures, together with the observations nade 
earlier, show that, although faster global and local nenories ore better, 
the predoninant effect on the perfornance is that of the nunber of pro- 
cessor buses* Figure 14(i)? however, gives Ou b-^tter connentary on the 
interaction between these fartors* 

Figure 14(i)s In this figure, the local and global neaoory speeds v;ere 
kept Glual. It is seen that for slov/er nenories, vath 2 global nenories, 
the perfornance tends to satiorate vAien the nunber of processor buses is 
increased. For fo-stei nenories, there is hardly any difference betv/een the 
curves for 2 and 4 global nenories. In this region, there is oji alnost 
linear perfornoonce increase with the nunber of buses. 

Since the systen built at BBN has 4 global nenories, we can interpret 
those figures to neaxi that the systen perfomownce car be inproved by speeding 
up the nenories, but a spectorular linear inprovenent cain be achieved by 
increasing the nurober of processor buses. However, increa-sing the nunber of 
global nenories and increasing the processor and bus speeds vlll not ha.ve any 
significant inpact on the perfornance* It also seens reasonrlle to a.ssert 
that, just as in crossbar sYltch systens, the perfornance of the Pluribus 
can be increased without any law of dininishing returns so long as the 
procGssor-Lienory bandwidths ore kept natched* 
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T/lBLE y 

OaiPAEI sow OE AIIilLYTIC AED SniULi.TIOH 
RESULTS ECR PLURIBUS MODEL 


n n , t t T t t- 

nsBc nsec nse8 nsec 

1 7 1425 850 425 200 

1 7 1425 850 1000 200 

2 4 1425 400 400 200 

2 4 1425 1000 1000 200 

2 5 1425 500 500 200 

2 5 1425 1200 1200 200 

2 6 1425 600 600 200 

2 6 1425 1400 1400 200 

2 7 1425 400 850 200 

2 7 1425 500 5r00 200 

2 7 1425 500 1250 200 

2 7 1 425 85 0 5 00 200 

2 7 1 425 850 1 200 20C 

2 7 1425 1000 850 200 

2 7 1425 1200 1200 200 

2 7 1425 1200 1250 200 

2 8 1425 700 700 200 

8 1425 1600 1600 200 


AncUytic SiiTulo^ion 

UER UER Percentage 

inst s/ inst s/ Error 

usee usee 


5.1394 

5.2247 

1.66 

2,9990 

3.0273 

0.94 

3.4923 

3.5740 

2.34 

2.6235 

2.7160 

3.53 

4.1347 

4.2296 

2.30 

2.9519 

3.0602 

3.67 

4.6846 

4.7943 

2.34 

3.1497 

3.2194 

2.21 

5.2217 

5 .2705 

0.93 

5.7337 

5 .8423 

1 .89 

4.2306 

4.1876 

1 .02 

5.1852 

5.3191 

2.58 

4.1354 

4.1471 

0.28 

4.5636 

4.6654 

2.23 

3,9112 

3.9723 

1 .56 

3.8408 

3.8837 

1 .12 

5.7890 

5 .8889 

1 .73 

3.4393 

3.3811 

1.69 


2 


TABLE V (Continued) 
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2 

9 

1425 

800 

800 

200 

2 

9 

1425 1800 

1800 

200 

2 

10 

1425 

400 

400 

200 

2 

10 

1425 1000 

O 

(-I 

o 

200 

3 

7 

1425 

850 

700 

200 

3 

7 

1425 

850 

1400 

200 

4 

4 

600 

850 

1250 

200 

4 

4 

1400 

850 

1250 

200 

4 

4 

1425 

500 

503 

200 

4 

4 

1425 

1200 

1200 

200 

4 

5 

800 

850 

1250 

200 

4 

5 

1425 

400 

1250 

200 

4 

5 

1425 

600 

600 

200 

4 

5 

1425 

850 

425 

200 

4 

5 

1425 

850 

1000 

200 

4 

5 

1425 

850 

1250 

100 

4 

5 

1425 

850 

1250 

250 

4 

5 

1425 

1000 

1250 

200 

4 

5 

1425 

1400 

1400 

200 

4 

5 

1600 

O 

ITN 

CO 

1250 

200 

4 

6 

1000 

050 

1250 

200 

4 

6 

1425 

500 

1250 

200 

4 

6 

1425 

700 

700 

200 


85 0 5 00 200 


5.9993 

5.9862 

0.22 

3.2232 

3.1259 

3.02 

8.5542 

8.6715 

1.37 

5.5554 

5.3506 

3.69 

5.0239 

5.1593 

2.70 

4.2161 

4.3191 

2.44 

3.5119 

3.7111 

5.67 

2.6792 

2.7807 

3.79 

3.5387 

3.4220 

2.50 

2.4617 

2.5832 

4.94 

4.0230 

4.2322 

5.20 

3.6726 

3.7750 

2.79 

3.9717 

4.0837 

2.82 

3.7957 

3.8960 

2-64 

3 .4497 

3.5675 

3.41 

3.5210 

3.6529 

3.75 

3.1861 

3.2998 

3.57 

3.1783 

3.3010 

3.06 

2.8291 

2.9790 

5.50 

3.1291 

3.2344 

3.37 

4.4532 

4.6649 

4.75 

4.2484 

4.3916 

3.37 

4.5342 

4.6714 

3.03 

4.4969 

4.6144 

2.61 


4 


6 1425 
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4 

4 

.1 

4 

4 

4 

4 

( 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 


TJTn’i'P! Y (Continued) 


6 

1425 

850 

1 200 

6 

1425 

850 

1250 

6 

1425 

850 

125 0 

6 

14 25 

1200 

1250 

6 

1425 

1600 

1600 

6 

1800 

850 

1250 

7 

1200 

850 

1250 

7 

1425 

600 

600 

7 

1425 

600 

850 

7 

1425 

700 

1250 

7 

1425 

850 

700 

7 

1425 

850 

850 

7 

1425 

850 

1250 

7 

1425 

850 

1250 

7 

1425 

850 

1400 

7 

1425 

850 

1600 

7 

1425 

1403 

850 

7 

1425 

1400 

1400 

7 

1425 

1600 

1250 

7 

2000 

850 

1250 

8 

600 

850 

1250 

8 

1400 

850 

1250 


200 

3 . 954 9 

4.0847 

150 

4.0436 

4.2021 

300 

3.6685 

3.7934 

200 

3.6110 

3.7622 

200 

3-1233 

3-2740 

200 

3.5238 

3.6292 

200 

4.8198 

5.0014 

200 

5.5335 

5-6743 

200 

5.2525 

5.3717 

200 

4.6734 

4-8314 

200 

5.0582 

5 -2028 

200 

4.9162 

5.0812 

203 

4.5180 

4.6748 

350 

4.1100 

4.2252 

200 

4.3653 

4.5071 

200 

4.1623 

4.5073 

200 

4.2730 

4.4556 

200 

3.8893 

4.0834 

200 

3.8272 

3.9749 

200 

3.8727 

3-9734 

200 

6.4194 

6.5819 

200 

5.1364 

5.3035 


3.28 

3.92 

3.40 

4.19 

4.83 

2.99 
3.77 
2.54 

2.27 
3.38 
2.86 
3.36 

3.47 
2.80 
3.25 

3.48 

4.27 

4.99 
3.86 
2.60 
2.53 
3-25 


4 
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Ti\BLE Y (Contimed) 


4 

8 

1425 

600 

1250 

200 

5«3841 

5 .3233 

2.59 

4 

8 

1425 

80 J 

800 

200 

5 .7238 

5.9070 

3.20 

4 

8 

1425 

850 

850 

20 : 

5.5904 

5.7609 

3.05 

4 

8 

1425 

850 

1250 

250 

4.9489 

5.0800 

2.65 

4 

8 

1425 

850 

1250 

400 

4.5145 

4.6487 

2.97 

4 

8 

1425 

850 

1600 

200 

4.6600 

4.8125 

3.27 

4 

8 

1425 

1400 

1250 

200 

4.5303 

4-7251 

4.30 

4 

8 

1425 

180'^ 

1800 

200 

3.7738 

3.9344 

4 .26 

4 

9 

800 

850 

1250 

200 

6.6645 

6 .8668 

3.04 

4 

9 

1425 

400 

400 

200 

7.8388 

8.0010 

2.07 

4 

9 

1425 

700 

1250 

200 

5.8394 

5.9483 

1 .86 

4 

9 

1425 

850 

425 

20^ 

6.8054 

6.9755 

2.50 

4 

9 

1425 

850 

1 000 

200 

6.0392 

6.1993 

2.65 

4 

9 

1425 

850 

1250 

100 

5.9905 

6 .1348 

2.41 

4 

9 

1425 

850 

1250 

300 

5 .3408 

5 .4739 

2.49 

4 

9 

1425 

1000 

1000 

200 

5.8283 

6.0256 

3.39 

4 

9 

1425 

1600 

1250 

200 

4.8468 

5 .0061 

3.29 

4 

9 

1600 

850 

1250 

200 

5.4129 

5.5169 

1 .92 

4 

10 

1000 

850 

1250 

200 

6.8713 

7.0482 

2.57 

4 

10 

1425 

500 

500 

200 

8.2539 

8.4320 

2.16 

4 

10 

1425 

800 

1250 

200 

6 .2566 

6.3687 

1 .79 

4 

10 

1425 

850 

500 

200 

7.4499 

7.6367 

2.51 

4 

10 

1425 

850 

1200 

200 

6.2886 

6 .4082 

1 .90 



TABLE V (ContinuedJ^ 


4 

10 

1425 

850 

4 

10 

1425 

050 

4 

10 

1425 

120l 

4 

10 

1425 

100.' 

4 

1 J 

1800 

050 

4 

11 

1200 

850 

4 

11 

2000 

850 

4 

12 

600 

850 

4 

12 

1400 

850 

5 

7 

1425 

050 

5 

7 

1425 

050 

6 

7 

1425 

050 

6 

7 

1425 

850 


1250 

150 

6.3647 

1250 

350 

5.6975 

1 200 

20 : 

5.0544 

1250 

200 

5.1311 

1250 

20 ''^. 

5.6562 

1250 

200 

7.0501 

1250 

200 

5.8728 

1250 

200 

8.3477 

1250 

200 

7.2054 

' r 25 

200 

5.3104 

1 0 oO 

O 

CM 

4.8073 

500 

200 

5.2536 

1200 

200 

4.6571 


6 .4345 

1 .80 

5.8412 

2.52 

5-9969 

2.43 

5 .3574 

4.41 

5.7656 

1 .93 

7-0362 

0.20 

5.9830 

1.89 

8.2118 

1 .63 

7.2179 

0.17 

5 .4478 

2.59 

4.9641 

3.26 

5.3944 

2.68 

4.8268 

3.64 
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5 .5 Siiiulntion Result s 

Simulation studios were conducted to validate the nodel for the Pluribus 
systen presented in this chapter. The siaulotion progra;.i v/as vjritten in 
POETEAl'I IV and run on an IBM 7044. The progroiii iiade only the three assunp- 
tions listed in Section 5.2 5 it took constant processing tines as well as 
constant nenory cycle tines. The instruction execution rate was averaged 
over a total of 5000 nenory cycles. This anounted to the processing of 
anywhere between 5000 and 32000 instructions (approximately) by the aulti- 
processor systen, depending on the systen parameters. 

The simulation results together mth the a,nalytic results for some 
roprosontative cases are shorn in Table V. It can be seen that the errors 
arc all belov/ 6 percent. This verifies the accuracy and usefulness of the 
analytic ELuribus nodel. It also shows that the approximations nade in 
replacing parts of the systen by 'virtual' nenories and 'virtual' processors 
(see Section 5.3) are reasonable and do not have a significant inpact on the 
accuracy of the nodel. 



CHAPTER 6 


GONCLUSIOIJS 

Vfe shall now siuniailze the nain results obtained in this thesis and 
suggest directions that future vrark in this area nay take. 

In Chapter 3j we discussed a, nodel for crossba,r svQ.tch systens \jliich 
bo.kos into account the local referencing property that chai'o.cterizes iiost 
conputor prograns# It v/o.s found that the perfomance of nulti processor 
systcus with the local reference nodel is worse than that predicted by the 
traditional unifom reference nodel. We also derived nevi? expressions for 
the uniforu reference nodel and coipared then with expressions available 
in the literature. Our expressions v7ere nore accurate than the existing 
expressions in most cases. 

In Chapter 4j we presented a Iforkov cha.in nodel for nultiprocessor 
systens using a tine-shared bus. For rea.sons nentioned in Chapter 2, a 
single til e-shared bus is not generally used in large nultiprocessor systens. 
However, it is useful to have such nodels, not only because they help us in 
analysing other systcus such as the Plirribus, but also because they a,id in 
the understanding of evaluation techniques for nultiprocessors. 

In Chapter 5, we presented on analytic nodel for the Pluribus systc-n. 
This nodel deconposes tlie Pluribus into its conponents conprising the cross- 
bar svdtch and the processor buses. The perfomance of the total systen is 
calculated in an itera.tive way fron the perfomance of the tyro different 
conponents. In presenting our results, we used the nodel for the tine- 
shared bus developed earlier and an expression derived by Strecker for the 
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crossbar switch. We would now like to point out that the iterative nethod 
used in the nodel is independent of the uodels or expressi -ns that rio.y be 
used for calculating the perfomances cf the conponont systeus. If, o-s 
SGcns plausible, better nethods are available in future to analyse the 
tine-shared bus and crossbar switch s3''stens, these nay be profitably used 
in this Huribus nodel t o give better prediction of the systen perfomance. 
Iitirther, if analytic expressions are found for both the sub-systens, it 
na.y even becone possible to express the Pluribus perfoma.nce analytically 
by solving Equation (5.1 ). 

However, the absence of an analytical expression for the ELuribus 
perfomance in no way detracts fron the usefiilness of the nodel. The progran 
vritten to inplenent the analysis nethod is q.'uite sinple and allows a systen 
designer to quiokLy evaluate a large design space in an efficient way. This 
should help in evaluating the perfomance of systens based on the Pluribus 
architecture, and in pointing out vAiere the bottlenecks lie and 'itiat netnods 
to use for inproving the systen perfomance. 

It is also our hope that our efforts vdll spark an interest in devising 
hotter analytic nodcls for uultiprocessor s^'^stens. The tools availa.ble in 
this area are still pitifully few and a lot of work is needed to catch up 
vdth the rapid grovrth in nultiprocessor technology. Work on perfomance 
evaluation has so far been linited only to crossbar svdtch systens. We 
have now given analytic nodels for tine-shared bus and Huribus systens. 



Those uodels theusolves stand in need of inproveuent. In addition, 
ifc is necessary to work on uodels for nultiport uenory/nultibus systens 
and for other 'unconventional systens which me being designed these days. 

':ith the a.dvont of nicroprocessors on the conputer scene, it is no’,,' be- 
coi'-iing feasible to construct nultiprocessor systens containing a large 
nuiubor of processors (typically himdreds of uicroprocessors ) vdiich represents 
an increase of an order of nagnitude in the nunber of functional ■units 
connected to the system. Clearly, the intercom: cti on uechanisn, used in 
these systens is of crucial inportance and work is in progress to devise 
now and better interconnection structures [Bar 75, Swa 76], To keep pane 
vith these developnents, the tools of analytic nodelling nust be improved 
so that the system designer can easily evaluate and compare the various 
choices he fanes, lie hope that this thesis has been one step for?(/ard tov/ords 
these goals. 
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APPIHDIX 

TR/lNSITIOII PL'-PHIX POE THE BUS MODEL EXAMPLE 

Eor the bus lioEgI discussed in Section 4.4, vath p=2, n =2, n =1 , the transition natrix is: 
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